import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
Reading the 'KNN_Project_Data csv file into a dataframe
data = pd.read_csv('KNN_Project_Data')
Checking the head of the dataframe.
data.head()
Since this data is artificial, we'll just do a large pairplot with seaborn.
Using seaborn on the dataframe to create a pairplot with the hue indicated by the TARGET CLASS column.
sns.pairplot(data)
sns.pairplot(data,hue='TARGET CLASS',palette='coolwarm')
Time to standardize the variables.
Importing StandardScaler from Scikit learn.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data.columns
Fitting scaler to the features.
scaler.fit(data.drop('TARGET CLASS',axis=1))
Using the .transform() method to transform the features to a scaled version.
scaled_data = scaler.transform(data.drop('TARGET CLASS',axis=1))
# fiitted_data.transform(data.iloc[:,:-1])
scaled_data
Converting the scaled features to a dataframe and check the head of this dataframe to make sure the scaling worked.
scaled_datafram = pd.DataFrame(data=scaled_data, columns= data.columns[:-1])
scaled_datafram.head()
Using train_test_split to split your data into a training set and a testing set.
from sklearn.model_selection import train_test_split
X = scaled_datafram
y = data['TARGET CLASS']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
Importing KNeighborsClassifier from scikit learn.
from sklearn.neighbors import KNeighborsClassifier
Creating a KNN model instance with n_neighbors=1
knn_model = KNeighborsClassifier(n_neighbors=1)
Fitting this KNN model to the training data.
knn_model.fit(X_train,y_train)
Let's evaluate our KNN model!
Using the predict method to predict values using your KNN model and X_test.
pred_y = knn_model.predict(X=X_test)
Creating a confusion matrix and classification report.
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, pred_y))
print(classification_report(y_test, pred_y))
Let's go ahead and use the elbow method to pick a good K Value!
Create a for loop that trains various KNN models with different k values, then keep track of the error_rate for each of these models with a list. Refer to the lecture if you are confused on this step.
error_rate = []
for i in range(1,40):
knn_model_i = KNeighborsClassifier(n_neighbors=i)
knn_model_i.fit(X_train,y_train)
y_pred_i = knn_model_i.predict(X=X_test)
error_rate.append(np.mean(y_pred_i != y_test))
print('printing',error_rate[i-1])
Now creating the following plot using the information from your for loop.
plt.plot(range(1,40),error_rate)
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',markerfacecolor='red', markersize=10)
plt.title('Error Rate Vs K Value')
plt.xlabel('K Value')
plt.ylabel('Error Rate')
Retraining your model with the best K value (up to you to decide what you want) and re-do the classification report and the confusion matrix.
knn_model2 = KNeighborsClassifier(n_neighbors=15)
knn_model2.fit(X=X_train,y=y_train)
y_pred2 = knn_model2.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_pred=y_pred2,y_true=y_test))
print(confusion_matrix(y_pred=y_pred2, y_true= y_test))